331 research outputs found
Efficient Query Processing for SPARQL Federations with Replicated Fragments
Low reliability and availability of public SPARQL endpoints prevent
real-world applications from exploiting all the potential of these querying
infras-tructures. Fragmenting data on servers can improve data availability but
degrades performance. Replicating fragments can offer new tradeoff between
performance and availability. We propose FEDRA, a framework for querying Linked
Data that takes advantage of client-side data replication, and performs a
source selection algorithm that aims to reduce the number of selected public
SPARQL endpoints, execution time, and intermediate results. FEDRA has been
implemented on the state-of-the-art query engines ANAPSID and FedX, and
empirically evaluated on a variety of real-world datasets
Comparing MapReduce and pipeline implementations for counting triangles
A generalized method to define the Divide & Conquer paradigm in order to have processors acting on its own data and scheduled in a
parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data
and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected
whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide & Conquer
paradigm, named pipeline. The main features of pipeline are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes.
An empirical evaluation is conducted on graphs of different sizes and densities. Observed results suggest that pipeline allows for the implementation of an efficient solution of the problem of counting
triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version
MapSDI: A Scaled-up Semantic Data Integration Framework for Knowledge Graph Creation
Semantic web technologies have significantly contributed with effective
solutions for the problems of data integration and knowledge graph creation.
However, with the rapid growth of big data in diverse domains, different
interoperability issues still demand to be addressed, being scalability one of
the main challenges. In this paper, we address the problem of knowledge graph
creation at scale and provide MapSDI, a mapping rule-based framework for
optimizing semantic data integration into knowledge graphs. MapSDI allows for
the semantic enrichment of large-sized, heterogeneous, and potentially
low-quality data efficiently. The input of MapSDI is a set of data sources and
mapping rules being generated by a mapping language such as RML. First, MapSDI
pre-processes the sources based on semantic information extracted from mapping
rules, by performing basic database operators; it projects out required
attributes, eliminates duplicates, and selects relevant entries. All these
operators are defined based on the knowledge encoded by the mapping rules which
will be then used by the semantification engine (or RDFizer) to produce a
knowledge graph. We have empirically studied the impact of MapSDI on existing
RDFizers, and observed that knowledge graph creation time can be reduced on
average in one order of magnitude. It is also shown, theoretically, that the
sources and rules transformations provided by MapSDI are data-lossless
Dynamic Pipeline: an adaptive solution for big data
The Dynamic Pipelineis a concurrent programming pattern amenable to be parallelized. Furthermore, the number of processing units used in the parallelization is adjusted to the size of the problem, and each processing unit uses a reduced memory footprint. Contrary to other approaches, the Dynamic Pipeline can be seen as ageneralization of the (parallel) Divide and Conquer schema, where systems can be reconfigured depending on the particular instance of the problem to be solved. We claim that the Dynamic Pipelines is useful to deal with Big Data related problems. In particular, we have designed and implemented algorithms for computing graphs parameters as number of triangles, connected components, and maximal cliques, among others. Currently, we are focused on designing and implementing an efficient algorithm to evaluate conjunctive query.Peer ReviewedPostprint (author's final draft
Comparing MapReduce and pipeline implementations for counting triangles
A common method to define a parallel solution for a computational problem consists in finding a way to use the Divide and Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide and Conquer paradigm, named dynamic pipeline. The main features of dynamic pipelines are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes. An empirical evaluation is conducted on graphs of different topologies, sizes, and densities. Observed results suggest that dynamic pipelines allows for an efficient implementation of the problem of counting triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version
Optimizing Federated Queries Based on the Physical Design of a Data Lake
The optimization of query execution plans is known to be crucial
for reducing the query execution time. In particular, query optimization
has been studied thoroughly for relational databases
over the past decades. Recently, the Resource Description Framework
(RDF) became popular for publishing data on the Web. As
a consequence, federations composed of different data models
like RDF and relational databases evolved. One type of these
federations are Semantic Data Lakes where every data source is
kept in its original data model and semantically annotated with
ontologies or controlled vocabularies. However, state-of-the-art
query engines for federated query processing over Semantic Data
Lakes often rely on optimization techniques tailored for RDF. In
this paper, we present query optimization techniques guided
by heuristics that take the physical design of a Data Lake into
account. The heuristics are implemented on top of Ontario, a
SPARQL query engine for Semantic Data Lakes. Using sourcespecific
heuristics, the query engine is able to generate more efficient
query execution plans by exploiting the knowledge about
indexes and normalization in relational databases. We show that
heuristics which take the physical design of the Data Lake into
account are able to speed up query processing
Dragoman: Efficiently Evaluating Declarative Mapping Languages over Frameworks for Knowledge Graph Creation
In recent years, there have been valuable efforts and contributions to make
the process of RDF knowledge graph creation traceable and transparent;
extending and applying declarative mapping languages is an example. One
challenging step is the traceability of procedures that aim to overcome
interoperability issues, a.k.a. data-level integration. In most pipelines, data
integration is performed by ad-hoc programs, preventing traceability and
reusability. However, formal frameworks provided by function-based declarative
mapping languages such as FunUL and RML+FnO empower expressiveness. Data-level
integration can be defined as functions and integrated as part of the mappings
performing schema-level integration. However, combining functions with the
mappings introduces a new source of complexity that can considerably impact the
required number of resources and execution time. We tackle the problem of
efficiently executing mappings with functions and formalize the transformation
of them into function-free mappings. These transformations are the basis of an
optimization process that aims to perform an eager evaluation of function-based
mapping rules. These techniques are implemented in a framework named Dragoman.
We demonstrate the correctness of the transformations while ensuring that the
function-free data integration processes are equivalent to the original one.
The effectiveness of Dragoman is empirically evaluated in 230 testbeds composed
of various types of functions integrated with mapping rules of different
complexity. The outcomes suggest that evaluating function-free mapping rules
reduces execution time in complex knowledge graph creation pipelines composed
of large data sources and multiple types of mapping rules. The savings can be
up to 75%, suggesting that eagerly executing functions in mapping rules enable
making these pipelines applicable and scalable in real-world settings
- …